R for physical scientists

Matt Lacey and Taha Ahmed

2: Data manipulation and visualisation

Packages we’ll use today!

ggplot2, dplyr, tidyr

Let’s load them:

library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats

dplyr

  • A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
  • Identify the most important data manipulation tools needed for data analysis and make them easy to use from R.
  • Provide blazing fast performance for in-memory data by writing key pieces in C++.
  • Use the same interface to work with data no matter where it’s stored, whether in a data frame, a data table or database.

tidyr

  • Provides functions for “tidying” data - gather() and spread()

Introduction to dplyr

dplyr provides several functions for manipulating data frames, e.g.,

select(), filter(), mutate(), rename()

Remember the mtcars dataset from yesterday?

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

select() selects specific columns from a data frame:

mtcars2 <- select(mtcars, mpg, cyl, disp)
str(mtcars2)
## 'data.frame':    32 obs. of  3 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...

select() can also be used to drop specific columns, with -:

mtcars3 <- select(mtcars, -am, -gear, -carb)
str(mtcars3)
## 'data.frame':    32 obs. of  8 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...

rename() renames specified columns. Use rename(data, newname = oldname). You can rename as many columns as you want in one rename() call.

mtcars4 <- rename(mtcars, ncyl = cyl, weight = wt)
str(mtcars4)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg   : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ ncyl  : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp  : num  160 160 108 258 360 ...
##  $ hp    : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat  : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ weight: num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec  : num  16.5 17 18.6 19.4 17 ...
##  $ vs    : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am    : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear  : num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb  : num  4 4 1 1 2 1 4 2 2 4 ...

filter() selects rows matching given conditions.

mtcars5 <- filter(mtcars, cyl == 6, am == 1)
mtcars5
##    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## 3 19.7   6  145 175 3.62 2.770 15.50  0  1    5    6

Multiple conditions can be specified for the same variable.

mtcars6 <- filter(mtcars, hp > 100, hp < 200)
str(mtcars6)
## 'data.frame':    16 obs. of  11 variables:
##  $ mpg : num  21 21 21.4 18.7 18.1 19.2 17.8 16.4 17.3 15.2 ...
##  $ cyl : num  6 6 6 8 6 6 6 8 8 8 ...
##  $ disp: num  160 160 258 360 225 ...
##  $ hp  : num  110 110 110 175 105 123 123 180 180 180 ...
##  $ drat: num  3.9 3.9 3.08 3.15 2.76 3.92 3.92 3.07 3.07 3.07 ...
##  $ wt  : num  2.62 2.88 3.21 3.44 3.46 ...
##  $ qsec: num  16.5 17 19.4 17 20.2 ...
##  $ vs  : num  0 0 1 0 1 1 1 0 0 0 ...
##  $ am  : num  1 1 0 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 3 3 3 4 4 3 3 3 ...
##  $ carb: num  4 4 1 2 1 4 4 3 3 3 ...

Comparison operators

  • x < y - x is less than y
  • x > y - x is greater than y
  • x <= y - x is less than or equal to y
  • x >= y - x is greater than or equal to y
  • x == y - x is equal to y
  • x != y - x is not equal to y

Value matching

For matching values to a vector, use %in%:

mtcars7 <- filter(mtcars, cyl %in% c(6, 8))
str(mtcars7)
## 'data.frame':    21 obs. of  11 variables:
##  $ mpg : num  21 21 21.4 18.7 18.1 14.3 19.2 17.8 16.4 17.3 ...
##  $ cyl : num  6 6 6 8 6 8 6 6 8 8 ...
##  $ disp: num  160 160 258 360 225 ...
##  $ hp  : num  110 110 110 175 105 245 123 123 180 180 ...
##  $ drat: num  3.9 3.9 3.08 3.15 2.76 3.21 3.92 3.92 3.07 3.07 ...
##  $ wt  : num  2.62 2.88 3.21 3.44 3.46 ...
##  $ qsec: num  16.5 17 19.4 17 20.2 ...
##  $ vs  : num  0 0 1 0 1 0 1 1 0 0 ...
##  $ am  : num  1 1 0 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 3 3 3 3 4 4 3 3 ...
##  $ carb: num  4 4 1 2 1 4 4 4 3 3 ...

mutate() adds new columns, preserving all previous ones.

mtcars8 <- mutate(mtcars, displ_l = disp / 61.0237)
str(mtcars8)
## 'data.frame':    32 obs. of  12 variables:
##  $ mpg    : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl    : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp   : num  160 160 108 258 360 ...
##  $ hp     : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat   : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt     : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec   : num  16.5 17 18.6 19.4 17 ...
##  $ vs     : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am     : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear   : num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb   : num  4 4 1 1 2 1 4 2 2 4 ...
##  $ displ_l: num  2.62 2.62 1.77 4.23 5.9 ...

Note: all dplyr methods ignore row names in data frames (on purpose). If you have them and want to keep them, they have to be converted to an explicit variable. dplyr has a function for this, rownames_to_column:

has_rownames(mtcars) # check if rownames exist
## [1] TRUE
mtcars9 <- rownames_to_column(mtcars, "name")
str(mtcars9)
## 'data.frame':    32 obs. of  12 variables:
##  $ name: chr  "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Exercise!

Look at the iris dataset (e.g., View(iris)).

Use filter() to make a new data frame df, for only the “setosa” species with a sepal length between 5 and 6.5 cm.

If you’ve managed this just fine, try out the other functions: rename(), select(), mutate()

Solution!

Your code should look something like this:

df <- filter(iris, Species == "setosa", Sepal.Length >= 5, Sepal.Length <= 6.5)
head(df)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          5.0         3.6          1.4         0.2  setosa
## 3          5.4         3.9          1.7         0.4  setosa
## 4          5.0         3.4          1.5         0.2  setosa
## 5          5.4         3.7          1.5         0.2  setosa
## 6          5.8         4.0          1.2         0.2  setosa

Let’s practice ggplot()

The diamonds dataset

There is a built-in dataset in the ggplot2 packages called diamonds.

str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

(Try also ?diamonds to find out more.)

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point()

Use aes() to map properties to variables:

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point(aes(color = cut)) # <<----

Properties not mapped to variables should not be inside aes()

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point(aes(color = cut), size = 2, shape = 21, alpha = 0.6) # <<----

variable <- plot

p <- ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point(aes(color = cut), size = 2, shape = 21, alpha = 0.6) # <<----

print(p)

ggplots are lists

str(p)
## List of 9
##  $ data       :Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of  10 variables:
##   ..$ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##   ..$ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##   ..$ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##   ..$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##   ..$ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##   ..$ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##   ..$ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##   ..$ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##   ..$ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##   ..$ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
##  $ layers     :List of 1
##   ..$ :Classes 'LayerInstance', 'Layer', 'ggproto' <ggproto object: Class LayerInstance, Layer>
##     aes_params: list
##     compute_aesthetics: function
##     compute_geom_1: function
##     compute_geom_2: function
##     compute_position: function
##     compute_statistic: function
##     data: waiver
##     draw_geom: function
##     geom: <ggproto object: Class GeomPoint, Geom>
##         aesthetics: function
##         default_aes: uneval
##         draw_group: function
##         draw_key: function
##         draw_layer: function
##         draw_panel: function
##         extra_params: na.rm
##         handle_na: function
##         non_missing_aes: size shape
##         parameters: function
##         required_aes: x y
##         setup_data: function
##         use_defaults: function
##         super:  <ggproto object: Class Geom>
##     geom_params: list
##     inherit.aes: TRUE
##     layer_data: function
##     mapping: uneval
##     map_statistic: function
##     position: <ggproto object: Class PositionIdentity, Position>
##         compute_layer: function
##         compute_panel: function
##         required_aes: 
##         setup_data: function
##         setup_params: function
##         super:  <ggproto object: Class Position>
##     print: function
##     show.legend: NA
##     stat: <ggproto object: Class StatIdentity, Stat>
##         compute_group: function
##         compute_layer: function
##         compute_panel: function
##         default_aes: uneval
##         extra_params: na.rm
##         non_missing_aes: 
##         parameters: function
##         required_aes: 
##         retransform: TRUE
##         setup_data: function
##         setup_params: function
##         super:  <ggproto object: Class Stat>
##     stat_params: list
##     subset: NULL
##     super:  <ggproto object: Class Layer> 
##  $ scales     :Classes 'ScalesList', 'ggproto' <ggproto object: Class ScalesList>
##     add: function
##     clone: function
##     find: function
##     get_scales: function
##     has_scale: function
##     input: function
##     n: function
##     non_position_scales: function
##     scales: list
##     super:  <ggproto object: Class ScalesList> 
##  $ mapping    :List of 2
##   ..$ x: symbol carat
##   ..$ y: symbol price
##  $ theme      : list()
##  $ coordinates:Classes 'CoordCartesian', 'Coord', 'ggproto' <ggproto object: Class CoordCartesian, Coord>
##     aspect: function
##     distance: function
##     expand: TRUE
##     is_linear: function
##     labels: function
##     limits: list
##     range: function
##     render_axis_h: function
##     render_axis_v: function
##     render_bg: function
##     render_fg: function
##     train: function
##     transform: function
##     super:  <ggproto object: Class CoordCartesian, Coord> 
##  $ facet      :List of 1
##   ..$ shrink: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "null" "facet"
##  $ plot_env   :<environment: R_GlobalEnv> 
##  $ labels     :List of 3
##   ..$ x     : chr "carat"
##   ..$ y     : chr "price"
##   ..$ colour: chr "cut"
##  - attr(*, "class")= chr [1:2] "gg" "ggplot"

adding layers

p <- p + ggtitle("diamonds")

print(p)

Scaling

There are a very large number of functions for modifying properties mapped to variables, such as x, y, size, shape, alpha, color, fill, etc.

They all begin with scale_

Scaling axes can be done with, for example, scale_x_continuous() and scale_y_continuous(), which removes values outside the range…

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point(aes(color = cut), size = 2, shape = 21, alpha = 0.6) +
  scale_x_continuous(limits = c(0, 3)) + # <<----
  scale_y_continuous(limits = c(0, 10000)) # <<----

another alternative is to use or coord_cartesian(), which effectively rescales the plot window - this can be useful sometimes:

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point(aes(color = cut), size = 2, shape = 21, alpha = 0.6) +
  coord_cartesian(xlim = c(0, 3), ylim = c(0, 10000)) # <<----

There are useful functions for changing colour schemes based on specially suited colour palettes.

  • scale_color_brewer() and scale_fill_brewer() for discrete data
  • scale_color_distiller() and scale_fill_distiller() for continuous data

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point(aes(color = cut), size = 2, shape = 21, alpha = 0.6) +
  scale_color_brewer(palette = "Set1") # <<----

Legend title and labels can be changed from within the scale_ function

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point(aes(color = cut), size = 2, shape = 21, alpha = 0.6) +
  scale_color_brewer("Grade", palette = "Set1", labels = c("E", "D", "C", "B", "A")) # <<----

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point(aes(color = x * y), size = 2, alpha = 0.6) +
  scale_color_distiller(palette = "YlGnBu", limits = c(10, 120)) # <<----

… and many more possibilities

  • scale_x_log10()
  • scale_size_discrete()
  • scale_fill_continuous()
  • scale_alpha_manual()

and so on…

I recommend reading the ggplot2 documentation to learn more!

Labelling

Use xlab() and ylab() to label axes:

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point(aes(color = cut), size = 2, shape = 21, alpha = 0.6) +
  scale_x_continuous(limits = c(0, 3)) + 
  scale_y_continuous(limits = c(0, 10000)) +
  xlab("weight / carats") + # <<----
  ylab("price / USD") # <<----

Facetting

diamonds2 <- filter(diamonds, cut %in% c("Premium", "Ideal"), 
                    clarity %in% c("VVS1", "VVS2", "IF"))

p2 <- ggplot(diamonds2, aes(x = carat, y = price)) +
  geom_point(aes(color = color), size = 2, alpha = 0.5)

print(p2)

Create panels for each value of a variable with facet_grid(rows ~ columns)

p2 + facet_grid(. ~ cut)

p2 + facet_grid(clarity ~ cut)

Use facet_wrap() if you have a 1-dimensional sequence of panels and want to wrap it into a fixed number of rows or columns:

p2 + facet_wrap( ~ color, ncol = 4)

a note about factors:

A factor is a “category variable” - it is only allowed to have certain values (levels).

df_a contains battery testing data for cycles 5, 10, 20 of a battery.

str(df_a)
## 'data.frame':    1601 obs. of  7 variables:
##  $ step.n: int  1 1 1 1 1 2 2 2 2 2 ...
##  $ step.t: num  2.03 4.06 6.08 8.11 10.01 ...
##  $ cyc.n : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ I     : num  0 0 0 0 0 ...
##  $ E     : num  2.58 2.57 2.56 2.55 2.54 ...
##  $ Q.d   : num  NA NA NA NA NA ...
##  $ Q.c   : num  NA NA NA NA NA NA NA NA NA NA ...

If I plot voltage E vs charge (Q.d and Q.c) directly, colouring lines according to cycle number cyc.n:

ggplot(df_a) +
  geom_path(aes(x = Q.d, y = E, color = cyc.n)) +
  geom_path(aes(x = Q.c, y = E, color = cyc.n))

cyc.n, as a number, is assumed to be continuous data, and the colour scale is a gradient by default for this reason. I can get the behaviour I want by converting cyc.n to a factor directly in the plot:

ggplot(df_a) +
  geom_path(aes(x = Q.d, y = E, color = factor(cyc.n))) +
  geom_path(aes(x = Q.c, y = E, color = factor(cyc.n)))

I can also modify the data:

df_a$cyc.n <- factor(df_a$cyc.n, levels = c(5, 10, 20))
str(df_a)
## 'data.frame':    1601 obs. of  7 variables:
##  $ step.n: int  1 1 1 1 1 2 2 2 2 2 ...
##  $ step.t: num  2.03 4.06 6.08 8.11 10.01 ...
##  $ cyc.n : Factor w/ 3 levels "5","10","20": 1 1 1 1 1 1 1 1 1 1 ...
##  $ I     : num  0 0 0 0 0 ...
##  $ E     : num  2.58 2.57 2.56 2.55 2.54 ...
##  $ Q.d   : num  NA NA NA NA NA ...
##  $ Q.c   : num  NA NA NA NA NA NA NA NA NA NA ...

geoms

There are many - read the ggplot2 documentation!

Most useful:

  • geom_point(), geom_path(), geom_line(), geom_bar()

themes

p2 + facet_wrap( ~ color, ncol = 4) +
  theme_bw()

p2 + facet_wrap( ~ color, ncol = 4) +
  theme_classic()

p2 + facet_wrap( ~ color, ncol = 4) +
  theme_minimal()

theme_Lacey <- function(base_size=15, base_family="Lato Medium") {
  library(grid)
  library(ggthemes)
  (theme_foundation(base_size=base_size, base_family=base_family)
  + theme(plot.title = element_text(size = rel(1.2), hjust = 0.5),
          text = element_text(),
          panel.background = element_rect(colour=NA),
          plot.background = element_rect(fill = "transparent", colour=NA),
          panel.border = element_rect(colour = NA),
          axis.title = element_text(size = rel(1), colour="#333333", family = "Lato"),
          axis.title.y = element_text(angle=90, colour="#333333", family = "Lato"),
          axis.text = element_text(size = rel(0.8)), 
          axis.line.x = element_line(size=0.5, colour="#333333"),
          axis.ticks.length=unit(-0.15, "cm"),
          axis.text.x = element_text(margin = margin(0.5, 0, 0.2, 0, "cm"), colour="#666666"),
          axis.text.y = element_text(margin = margin(0, 0.5, 0, 0.2, "cm"), colour="#666666"),
          panel.grid.major = element_line(colour="#eaeaea", size = 0.5),
          panel.grid.minor = element_blank(),
          legend.key = element_rect(colour = NA),
          legend.key.size = unit(0.6, "cm"),
          legend.margin = unit(0, "cm"),
          strip.background=element_rect(colour="#eaeaea",fill="#eaeaea"),
          strip.text = element_text(family = "Lato", 
                                    colour = "#333333", lineheight=0.7),
          legend.text = element_text(family = "Lato", colour = "#333333")
  ))
  
}

p2 + facet_wrap( ~ color, ncol = 4) +
  theme_Lacey()

dplyr -> ggplot

%>%

%>% is the “pipe” operator.

It comes from the magrittr package and is loaded automatically along with dplyr/tidyverse.

%>% “pipes” an object to the first argument of a function, i.e:

x %>% f(y, z)

is the same as:

f(x, y, z)

This creates code which is easily read left-to-right.

For example:

diamonds %>%
  ggplot(aes(x = carat, y = price)) +
    geom_point(aes(color = cut), size = 2, shape = 21, alpha = 0.6)

filter(diamonds, cut %in% c("Premium", "Ideal")) %>%
  ggplot(aes(x = carat, y = price)) +
    geom_point(aes(color = cut), size = 2, shape = 21, alpha = 0.6)

In cases where you don’t want the object on the left hand side to be the first argument in the function, use the dot (.) as placeholder:

y %>% f(x, ., z)

is equivalent to f(x, y, z)

But there will be more of this tomorrow!

gather() and spread()

mean_iris <- iris %>%
  group_by(Species) %>%
  summarise_all(mean)

mean_iris
## # A tibble: 3 × 5
##      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
##       <fctr>        <dbl>       <dbl>        <dbl>       <dbl>
## 1     setosa        5.006       3.428        1.462       0.246
## 2 versicolor        5.936       2.770        4.260       1.326
## 3  virginica        6.588       2.974        5.552       2.026

Suppose I want to plot each of these flower attributes on a chart vs Species?

This is the wrong way to do it:

mean_iris %>%
  ggplot(aes(x = Species)) +
    geom_point(aes(y = Sepal.Length), color = "red") +
    geom_point(aes(y = Sepal.Width), color = "blue") +
    geom_point(aes(y = Petal.Length), color = "black") +
    geom_point(aes(y = Petal.Width), color = "dark green")

Data should be converted so that data points are tabulated as key-value pairs. This is what the gather() function is for:

long_iris <- gather(mean_iris, key = flower_att, value = measurement, -Species)
head(long_iris)
## # A tibble: 6 × 3
##      Species   flower_att measurement
##       <fctr>        <chr>       <dbl>
## 1     setosa Sepal.Length       5.006
## 2 versicolor Sepal.Length       5.936
## 3  virginica Sepal.Length       6.588
## 4     setosa  Sepal.Width       3.428
## 5 versicolor  Sepal.Width       2.770
## 6  virginica  Sepal.Width       2.974

long_iris %>%
  ggplot(aes(x = Species, y = measurement, color = flower_att)) +
    geom_point()

spread() does the opposite to gather()

wide_iris <- spread(long_iris, key = flower_att, value = measurement)
str(wide_iris)
## Classes 'tbl_df', 'tbl' and 'data.frame':    3 obs. of  5 variables:
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 2 3
##  $ Petal.Length: num  1.46 4.26 5.55
##  $ Petal.Width : num  0.246 1.326 2.026
##  $ Sepal.Length: num  5.01 5.94 6.59
##  $ Sepal.Width : num  3.43 2.77 2.97

Saving plots

ggsave()

p1 <- long_iris %>%
  ggplot(aes(x = Species, y = measurement, color = flower_att)) +
    geom_point()

ggsave("p1.png", plot = p1, width = 6, height = 4, units = "in", dpi = 300)

ggsave() works out the format to save as from the file extension. It accepts .eps/.ps, .tex (pictex), .pdf, .jpeg, .tiff, .png, .bmp, .svg and (only on Windows) .wmf.

That’s all for today!

Key functions:

  • from dplyr: select(), filter(), mutate(), rename()
  • from ggplot: ggplot(), aes(), geom_*(), scale_*(), xlab(), ylab(), ggtitle(), theme_*(), ggsave()
  • from tidyr: gather(), spread()
  • %>%